[DEV-1434] Wire eval command to judge-facts cross-session memory#12
[DEV-1434] Wire eval command to judge-facts cross-session memory#12alexeyzimarev merged 2 commits intomainfrom
Conversation
Third leg of DEV-1434: the CLI now fetches retained judge facts per category at eval startup, injects them into each judge prompt as "known patterns", parses an optional retain_fact from each judge response, and POSTs novel facts back to the server for future runs. - Models.cs: JudgeFactPayload (write), JudgeFact (read), registered in the source-gen JSON context as List<JudgeFact> + JudgeFactPayload. - EvalCommand: FetchAllJudgeFactsAsync loads all four categories at startup and stores them per-category for per-question injection; a fetch failure for any single category is logged and skipped (non-fatal). PostJudgeFact runs after each judge when retain_fact is non-null and non-empty. - ExtractRetainFact is a standalone parser that tolerates code fences and rejects null/undefined/empty/whitespace/non-string values — independent of ParseVerdict so retained-fact plumbing doesn't regress if verdict parsing ever fails. - FormatKnownPatterns renders the facts as a bulleted block, with an explicit "(none yet)" marker when empty so the prompt still reads naturally on a fresh system. - Prompt template grew a "Known patterns" section and a retain_fact field in the response schema, with strict guidance on when to emit one (only generalizable patterns, never single observations). 10 new EvalCommandTests covering FormatKnownPatterns (empty + populated) and ExtractRetainFact (present, fenced, absent, null, empty, whitespace, non-string, malformed JSON). Full suite 205/205, AOT publish clean. Depends on the server-side endpoints in #475 (kurrent-io/Kurrent.Capacitor). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace .IsEqualTo(true) on nullable bool JSON reads with .IsTrue() (per TUnit analyzer suggestion). Coerce the nullable via `?? false` so the assertion documents "this must be true" rather than "this must equal true, or null is fine" — both match the original intent since the test setup writes enabledPlugins as `true`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review Summary by QodoWire eval command to judge-facts cross-session memory
WalkthroughsDescription• Implement cross-session judge-facts memory for eval command - Fetch retained facts per category at eval startup - Inject facts into judge prompts as "known patterns" section - Parse optional retain_fact from judge responses - POST novel facts back to server for future runs • Add FormatKnownPatterns and ExtractRetainFact helper methods - Format facts as bulleted list with empty-state marker - Tolerant parsing of retain_fact field (handles code fences, null, empty, malformed JSON) • Extend prompt template with "Known patterns" section and retain_fact response field • Add 10 new unit tests covering fact formatting and extraction edge cases • Fix 4 pre-existing TUnitAssertions0015 warnings in SetupCommandTests Diagramflowchart LR
A["Eval startup"] -->|FetchAllJudgeFactsAsync| B["Load facts per category"]
B -->|FormatKnownPatterns| C["Inject into judge prompt"]
D["Judge response"] -->|ExtractRetainFact| E["Parse retain_fact field"]
E -->|PostJudgeFactAsync| F["POST to /api/judge-facts"]
F -->|Future evals| A
File Changes1. src/kapacitor/Commands/EvalCommand.cs
|
Code Review by Qodo
|
| // If the judge emitted a retain_fact, persist it for future evals. | ||
| if (ExtractRetainFact(result.Result) is { } retainedFact) { | ||
| await PostJudgeFactAsync(httpClient, baseUrl, q.Category, retainedFact, context.SessionId, evalRunId); | ||
| } |
There was a problem hiding this comment.
1. Retain_fact skipped on failure 🐞 Bug ≡ Correctness
HandleEval only persists retain_fact after ParseVerdict succeeds, so any judge response that includes a usable retain_fact but fails verdict parsing is silently not retained. This contradicts ExtractRetainFact’s intent that retention plumbing “doesn't depend on verdict parsing succeeding.”
Agent Prompt
### Issue description
`retain_fact` persistence is currently gated on `ParseVerdict` succeeding. If the judge returns a response where `retain_fact` is present/valid but the verdict JSON is unparseable or fails validation (e.g., missing/out-of-range score), the loop `continue`s before calling `ExtractRetainFact`, so the fact is never retained.
### Issue Context
`ExtractRetainFact` is documented as independent of `ParseVerdict` so retention shouldn't depend on verdict parsing succeeding.
### Fix
Refactor the per-question loop to extract `retain_fact` regardless of verdict parse outcome, and only gate `verdicts.Add(...)` on `ParseVerdict`.
A typical structure:
- `var retainedFact = ExtractRetainFact(result.Result);`
- `var verdict = ParseVerdict(...);`
- if verdict != null -> add
- if retainedFact != null -> post
### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[105-144]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| try { | ||
| using var resp = await httpClient.GetWithRetryAsync($"{baseUrl}/api/judge-facts?category={category}"); | ||
| if (!resp.IsSuccessStatusCode) { | ||
| Log($"Failed to fetch judge facts for {category}: HTTP {(int)resp.StatusCode}"); | ||
|
|
||
| continue; | ||
| } | ||
|
|
||
| var json = await resp.Content.ReadAsStringAsync(); | ||
| var list = JsonSerializer.Deserialize(json, KapacitorJsonContext.Default.ListJudgeFact) ?? []; | ||
| result[category] = list; | ||
| Log($"Loaded {list.Count} retained facts for category {category}"); | ||
| } catch (HttpRequestException ex) { | ||
| Log($"Could not load judge facts for {category}: {ex.Message}"); | ||
| } |
There was a problem hiding this comment.
2. Judge-facts json aborts eval 🐞 Bug ☼ Reliability
FetchAllJudgeFactsAsync only catches HttpRequestException; malformed/non-matching JSON from /api/judge-facts can throw JsonException from JsonSerializer.Deserialize and crash eval startup. This violates the comment that fact-fetch failures don’t abort the run.
Agent Prompt
### Issue description
`FetchAllJudgeFactsAsync` can throw `JsonException` when `/api/judge-facts` returns malformed JSON (or an unexpected shape). The method currently only catches `HttpRequestException`, so this exception will bubble up and abort `HandleEval` before any questions run.
### Issue Context
The comment in `HandleEval` says failures fetching retained facts should not abort the run.
### Fix
In `FetchAllJudgeFactsAsync`, extend error handling to catch `JsonException` (and optionally `NotSupportedException`) around `JsonSerializer.Deserialize`, log an informative message, and `continue` to the next category.
### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[93-99]
- src/kapacitor/Commands/EvalCommand.cs[247-269]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
Third leg of DEV-1434. The CLI now:
GET /api/judge-facts?category=<cat>× 4).Prompt template grew a "Known patterns" section and a `retain_fact` field in the response schema, with strict guidance on when to emit one — only generalizable patterns ("User force-pushes with uncommitted work"), never single observations ("Ran rm -rf /tmp/cache"), to stop the retained-fact list from degenerating into noise.
Design notes
Also cleaned up
Four pre-existing `TUnitAssertions0015` warnings in `SetupCommandTests` (`.IsEqualTo(true)` → `.IsTrue()`).
Test plan
🤖 Generated with Claude Code